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Abstract:  Recent  developments  in  sensor  technology,  signal  processing,  and  communica¬ 
tions  have  enabled  the  conception  and  deployment  of  large-scale  networked  sensing  systems 
consisting  of  coordinated  stationary  and  mobile  platforms  carrying  sensors  of  diverse  modal¬ 
ities.  The  promise  of  systems  lies  in  their  ability  to  intelligently  integrate  information  from 
massive  amounts  of  sensor  data.  At  the  core  of  these  challenges  is  a  fundamental  information 
fusion  task. 

The  research  performed  under  this  grant  served  to  address  the  modeling,  algorithmic 
and  theoretical  challenges  associated  with  these  problems  of  large-scale  information  fusion. 
Significant  accomplishments  include  (a)  the  development  of  message-passing  algorithms  for 
distributed  optimization  and  statistical  inference,  with  applications  to  sensor  fusion  and  com¬ 
puter  vision;  (b)  the  formulation  and  analysis  of  convex  relaxations  for  estimating  low-rank 
matrices  from  data,  with  applications  to  missing  data  problems,  and  tracking  of  dynamical 
systems;  (c)  the  development  of  non-parametric  methods  for  solving  high-dimensional  predic¬ 
tion  problems;  and  (d)  analysis  and  implementation  of  methods  for  selecting  graphical  models 
in  high  dimensions,  with  applications  to  terrorist  cell  monitoring,  and  social  network  analysis. 
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1  Summary 


In  this  final  report,  we  summarize  research  activity  associated  with  AFOSR  grant  FA9550-09-1-0466. 
Overall,  the  grant  was  used  to  support  the  co-PI  Prof.  Martin  Wainwright,  as  well  as  grad¬ 
uate  students  Sahand  Negahban  (Ph.D  2012,  now  an  assistant  professor  at  Yale  Univer¬ 
sity),  Garvesh  Raskutti  (Ph.  D  2012,  now  an  assistant  professor  at  University  of  Wisconsin, 
Madison),  Alekh  Agarwal  (Ph.D.  2012,  now  research  scientist  at  Microsoft  Research),  and 
Po-Ling  Loh  (Ph.D.  expected  in  2014).  The  grant  has  lead  to  the  refereed  conference  pa¬ 
pers  [26,  24,  33,  13],  as  well  as  the  journal  publications  [36,  2,  1,  34,  14,  37,  35,  25]. 


2  Honors  and  awards 

During  the  grant  period,  Professor  Martin  Wainwright  received  an  IEEE  Communications 
Society  Best  Paper  Award  (2010),  a  Joint  IEEE  Information  Theory  and  Communications 
Best  Paper  Award  (2012),  and  a  Medallion  Lecturership  from  the  Institute  of  Mathematical 
Statistics  (2013).  Graduate  Student  Po-Ling  Loh  received  a  Best  Paper  Award  from  the  NIPS 
Conference  in  December  2012. 


3  Estimation  of  low-rank  matrices  in  high  dimensions 

In  the  papers  [33,  34],  we  study  the  problem  of  estimating  a  matrix  0*  €  RpiXp2  that  is 
either  exactly  low  rank,  meaning  that  it  has  at  most  r  <C  min{pi,p2}  non-zero  singular 
values,  or  more  generally  is  near  low-rank,  meaning  that  it  can  be  well-approximated  by  a 
matrix  of  low  rank.  Such  exact  or  approximate  low-rank  conditions  are  appropriate  for  many 
applications,  including  multivariate  or  multi-task  forms  of  regression,  system  identification  for 
autoregressive  processes,  collaborative  filtering,  and  matrix  recovery  from  random  projections. 
Analogous  to  the  use  of  an  £i-regularizer  for  enforcing  sparsity,  we  consider  the  use  of  the 
nuclear  norm  (also  known  as  the  trace  norm)  for  enforcing  a  rank  constraint  in  the  matrix 
setting.  By  definition,  the  nuclear  norm  is  the  sum  of  the  singular  values  of  a  matrix,  and 
so  encourages  sparsity  in  the  vector  of  singular  values,  or  equivalently  for  the  matrix  to  be 
low-rank. 

One  motivation  for  our  work  is  the  problem  of  recovering  system  matrices  in  vector  au- 
togressive  (VAR)  processes  [28].  A  VAR  model  consists  of  a  sequence  {X(t)}^L1,  where  each 
X(t)  6  Rp  is  a  vector  of  state  variables,  that  evolves  according  to  the  recursion 

X(t  +  l)  =  e*X(t)  +  W(t),  t  =  1,2,3,---  , 

where  W[t)  £  are  driving  noise  terms.  Such  models  are  widely  used  in  different  settings. 
They  are  integral  parts  of  subspace  tracking  models  in  signal  processing,  motion  models  models 
in  computer  vision,  financial  data  analysis,  and  neural  data  analysis  (e.g.,  [15,  5,  10,  10]).  This 
model  and  closely  related  ones  also  arise  in  the  problem  of  collaborative  filtering  [41],  in  which 
the  goal  is  to  predict  users’  preferences  for  items  (such  as  movies  or  music)  based  on  their 
and  other  users’  ratings  of  related  items.  The  system  matrix  0*  £  MPxp  is  unknown,  and  our 
goal  is  to  estimate  it,  using  a  number  of  samples  N  less  than  the  dimension  of  the  problem. 
Imposing  a  rank  r  constraint  on  a  matrix  0*  £  MpiXpa  is  equivalent  to  requiring  the  rows  (or 
columns)  of  0*  lie  in  some  r-dimensional  subspace  of  MP2  (or  MPl  respectively) . 
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Figure  1.  (a)  Four  entries  of  a  p  =  100  dimension  vector  autoregressive  (VAR)  process, 
generated  from  a  system  matrix  0*  £  Rpxp  with  rank  r  =  3.  Every  component  is  a  mixture 
of  the  r  =  3  signal  components  with  p  —  r  =  97  noise  components,  (b)  Data  that  has  been 
“de-mixed”  using  the  learned  model  0:  the  first  three  components  (in  red)  are  estimates  of  the 
signal,  whereas  the  remaining  blue  component  is  pure  noise.  Note  how  the  signal  components 
are  much  smoother  than  the  noise  component. 


In  order  to  recover  low-rank  matrices,  we  propose  solving  the  following  semidefinite  pro¬ 
gram  (SDP), 

0<Earg  min  ~  3tjv(©)|||  +  Ajv|||0|||i},  (1) 

©gffiPl  xp2  l  Zl\  ) 


where  A tv  >  0  is  a  regularization  parameter.  Here  Xj v  :  MPlXp2  — y  is  an  operator  that  maps 

matrices  to  IV-vectors  of  observations.  This  optimization  problem  can  be  solved  efficiently  by 
various  techniques,  and  we  our  main  result  is  to  prove  upper  bounds  on  the  Frobenius  norm 
I©  —  ©*|||f  between  the  true  matrix  0*  and  the  estimate  0.  In  particular,  we  prove  that 
under  mild  conditions,  the  Frobenus  norm  error  satisfies  the  bound 


lie  -  e*|J.  =  0(!k L±rd) 


(2) 


with  probability  greater  than  1  —  ci  exp(— C2NX2n).  This  bound  implies  that  it  is  possible 
to  obtain  a  good  estimate  of  the  matrix  using  far  fewer  than  pi  p2  samples  as  long  as  the 
rank  r  is  small.  These  theoretical  results  provide  a  remarkably  good  characterization  of  the 
high- dimensional  scaling  of  this  method;  see  Figure  2  for  some  illustrative  results. 


4  Dual  Averaging  for  Distributed  Optimization 


We  consider  an  optimization  problem  based  on  functions  that  are  distributed  over  a  network. 
More  specifically,  let  G  =  ( V. ,  E )  be  an  undirected  graph  over  the  vertex  set  V  =  {1,2, ...  ,n} 
with  edge  set  E  C  V  x  V.  Associated  with  each  i  £  V  is  convex  function  fi  :  IRrf  — >  M,  and 
our  overarching  goal  is  to  solve  the  optimization  problem 


min 

X 


1 

n 


Em*) 

i= 1 


such  that  x  £  X , 


(3) 
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Error  versus  sample  size 


Error  versus  rescaled  sample  size 


Figure  2.  Results  of  applying  the  SDP  (1)  with  nuclear  norm  regularization  to  the  problem  of 
low-rank  multivariate  regression,  (a)  Plots  of  the  Frobenius  error  |||0  —  @*|||f  on  a  logarithmic 
scale  versus  the  sample  size  N  for  three  different  matrix  sizes  p  £  {40,80, 160},  all  with  rank 
r  =  10.  (b)  Plots  of  the  same  Frobenius  error  versus  the  rescaled  sample  size  N/  irp).  Consistent 
with  theory,  all  three  plots  are  now  extremely  well-aligned. 


where  X  is  a  closed  convex  set.  Each  function  /*  is  convex  and  hence  sub-differentiable,  but 
need  not  be  smooth.  We  assume  without  loss  of  generality  that  0  £  X,  since  we  can  simply 
translate  X .  Each  node  i  £  V  is  associated  with  a  separate  agent,  and  each  agent  i  maintains 
its  own  parameter  vector  Xi  £  Md.  The  graph  G  imposes  communication  constraints  on 
the  agents:  in  particular,  agent  i  has  local  access  to  only  the  objective  function  /j  and  can 
communicate  directly  only  with  its  immediate  neighbors  j  £  A f(i)  :  =  {j  £  V  |  ( i,j )  £  E}. 


Figure  3.  Illustration  of  some  graph  classes  of  interest  in  distributed  protocols,  (a)  A  3- 
connected  cycle,  (b)  Two-dimensional  grid  with  4-connectivity,  and  non-toroidal  boundary 
conditions,  (c)  A  random  geometric  graph,  (d)  A  random  3-regular  expander  graph. 


Problems  of  this  nature  arise  in  a  variety  of  application  domains,  and  as  motivation  for 
the  analysis  to  follow,  let  us  consider  a  few  here.  A  first  example  is  a  sensor  network,  in 
which  each  agent  represents  a  sensor  mote,  equipped  with  a  radio  transmitter  for  commu¬ 
nication,  some  basic  sensing  devices,  and  some  local  memory  and  computational  power.  In 
environmental  applications  of  sensor  networks,  each  mote  i  might  take  a  measurement  m  of 
the  temperature,  and  the  global  objective  could  be  to  compute  the  median  of  the  measure¬ 
ments  {2/1, 2/2,  •  •  •  ,Un}-  This  median  computation  problem  can  be  formulated  as  minimizing 
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the  scalar  objective  function  ^  Y17=  1  /*(x)>  where  fi(x)  =  \x  —  yi\.  Similar  formulations  apply 
to  the  problem  of  computing  other  statistics  such  as  means,  variances,  quantiles  and  other  M- 
estimators.  Another  motivating  example  is  the  machine  learning  problem  of  classification,  in 
which  X  is  the  parameter  space  of  the  statistician  or  learner.  Each  function  f%  is  the  empirical 
loss  over  the  subset  of  data  assigned  to  the  ith  processor,  and  assuming  that  each  subset  is 
of  equal  size  (or  that  the  /,  are  normalized  suitably),  the  average  /  is  the  empirical  loss  over 
the  entire  dataset.  Here  we  use  cluster  computing  as  our  computational  model,  where  each 
processor  is  a  node  in  the  cluster,  and  the  graph  G  contains  edges  between  those  processors 
that  are  directly  connected  with  small  network  latencies. 

We  consider  an  appropriate  and  novel  extension  of  dual  averaging  to  the  distributed  set¬ 
ting.  At  each  iteration  t  =  1,  2,  3, . . .,  the  algorithm  maintains  n  pairs  of  vectors  ( Xi(t ),  Zi(t))  £ 
X  xMd,  with  the  ith  pair  associated  with  node  i  £  V .  At  iteration  t,  each  node  i  6  h  computes 
an  element  gt(t)  £  dfi(xi(t ))  in  the  subdifferential  of  the  local  function  f,  and  receives  infor¬ 
mation  about  the  parameters  {zj(t),  j  £  AA(i)}  associated  with  nodes  j  in  its  neighborhood 
N(i).  Its  update  of  the  current  estimated  solution  Xi(t)  is  based  on  a  convex  combination 
of  these  parameters.  To  model  this  weighting  process,  let  P  £  Mnxn  be  a  symmetric  matrix 
of  non-negative  weights  that  respects  the  structure  of  the  graph  G ,  meaning  that  for  i  ^  j, 
Pij  >  0  only  if  (i,j)  £  E.  We  assume  that  P  is  a  doubly  stochastic  matrix,  so  that 

n  n 

Pij  =  P^  =  1  for  all  i  £  V,  and  y^  Pij  =  y^  Ptj  =  1  for  all  j  £  V. 

j= i  i= i  ieMij) 

Using  this  notation,  given  the  non-increasing  sequence  {a(t)}^0  of  positive  stepsizes,  each 
node  i  £  V  =  {1,2 , ,n}  performs  the  updates 

Zi(t  +  1)  =  PijZj(t)  -  gi(t),  and  (4a) 

Xi(t  +  1)  =  U$,(-Zi(t  +  (4b) 

where  the  projection  is  orthogonal  projection  onto  X .  Note  that  each  node  obtains  its 
new  dual  parameter  Zi(t  +  1)  from  a  weighted  average  of  its  own  subgradient  gi(t)  and  the 
parameters  {zj(t),j  £  AA(i)}  in  its  own  neighborhood  A f(i),  and  then  computes  the  next  local 
iterate  Xi(t  +  1)  by  a  projection  defined  by  the  proximal  function  V’  and  stepsize  a(t)  >  0. 

We  consider  several  settings  for  distributed  minimization.  We  study  fixed  communica¬ 
tion  protocols,  which  are  of  interest  in  a  variety  of  areas  such  as  cluster  computing  or  sensor 
networks  with  a  fixed  hardware-dependent  protocol.  We  also  investigate  randomized  commu¬ 
nication  protocols  as  well  as  randomized  network  failures,  which  are  often  essential  to  handle 
gracefully  in  wireless  sensor  networks  and  large  clusters  with  potential  node  failures.  Ran¬ 
domized  communication  also  provides  interesting  tradeoffs  between  communication  savings 
and  convergence  rates.  In  this  setting,  we  obtain  much  sharper  results  than  previous  work  by 
studying  the  spectral  properties  of  the  expected  transition  matrix  of  a  random  walk  on  the 
underlying  graph.  We  also  present  a  relatively  straightforward  extension  of  our  analysis  for 
stochastic  gradient  information. 

We  provide  sharp  bounds  on  their  convergence  rates  as  a  function  of  the  network  size  and 
topology.  Our  analysis  clearly  separates  the  convergence  of  the  optimization  algorithm  itself 
from  the  effects  of  communication  constraints  arising  from  the  network  structure.  We  show 
that  the  number  of  iterations  required  by  our  algorithm  scales  inversely  in  the  spectral  gap  of 
the  network.  The  sharpness  of  this  prediction  is  confirmed  both  by  theoretical  lower  bounds 
and  simulations  for  various  networks. 
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5  Sparse  non-parametric  regression  in  high  dimensions 


Sparsity  is  an  attractive  assumption  for  both  practical  and  theoretical  reasons:  it  leads  to 
more  interpretable  models,  reduces  computational  cost,  and  allows  for  model  identifiabil- 
ity  even  under  high-dimensional  scaling,  where  the  dimension  p  exceeds  the  sample  size  N. 
While  a  large  body  of  work  has  focused  on  sparse  linear  models,  many  applications  call  for 
the  additional  flexibility  provided  by  non-parametric  models.  In  the  general  setting,  a  non- 
parametric  regression  model  takes  the  form  y  =  f*{x i, . . . ,  xp)  +  w,  where  f*  :  MP  — >  M  is  the 
unknown  regression  function,  and  w  is  scalar  observation  noise.  Unfortunately,  this  general 
non-parametric  model  is  known  to  suffer  severely  from  the  so-called  “curse  of  dimensionality” , 
in  that  for  most  natural  function  classes  (e.g.,  twice  differentiable  functions),  the  sample  size 
N  required  to  achieve  any  given  error  grows  exponentially  in  the  dimension  p.  Given  this 
curse  of  dimensionality,  it  is  essential  to  further  constrain  the  complexity  of  possible  functions 
/*.  One  attractive  candidate  is  the  class  of  additive  non-parametric  models  [17],  in  which  the 
function  /*  has  an  additive  decomposition  of  the  form 

p 

f*(xi,X2,...,Xp)  =  (5) 

3= 1 

where  each  component  function  f  *  is  univariate.  Given  this  additive  form,  this  function  class 
no  longer  suffers  from  the  exponential  explosion  in  sample  size  of  the  general  non-parametric 
model.  Nonetheless,  one  still  requires  a  sample  size  N  3>  p  for  consistent  estimation;  note 
that  this  is  true  even  for  the  linear  model,  which  is  a  special  case  of  equation  (5). 

A  natural  extension  of  sparse  linear  models  is  the  class  of  sparse  additive  models,  in  which 
the  unknown  regression  function  is  assumed  to  have  a  decomposition  of  the  form 

f*(xi,x2  ■  •  •  ,xp)  =  ^2fj(xj),  (6) 

ieS 

where  S  C  {1,2, ...  ,p}  is  some  unknown  subset  of  cardinality  \S\  =  s.  Of  primary  interest 
is  the  case  when  the  decomposition  is  genuinely  sparse,  so  that  s  <C  p.  To  the  best  of  our 
knowledge,  this  model  class  was  first  introduced  by  [21],  and  has  since  been  studied  by  various 

researchers  [20,  29,  38,  45].  Note  that  the  sparse  additive  model  (6)  is  a  natural  generalization 

of  the  sparse  linear  model,  to  which  it  reduces  when  each  univariate  function  is  constrained 
to  be  linear. 

In  past  work,  several  groups  have  proposed  computationally  efficient  methods  for  estimat¬ 
ing  sparse  additive  models  (6).  Just  as  Id -based  relaxations  such  as  the  Lasso  have  desirable 
properties  for  sparse  parametric  models,  more  general  I^-based  approaches  have  proven  to 
be  successful  in  this  setting.  [21]  proposed  the  COSSO  method,  which  extends  the  Lasso  to 
cases  where  the  component  functions  f*  lie  in  a  reproducing  kernel  Hilbert  space  (RKHS);  see 
also  [45]  for  a  similar  extension  of  the  non- negative  garrote  [8] .  [6]  analyzes  a  closely  related 
method  for  the  RKHS  setting,  in  which  least-squares  loss  is  penalized  by  an  d-sum  of  Hilbert 
norms,  and  establishes  consistency  results  in  the  classical  (fixed  p)  setting.  Other  related  Id- 
based  methods  have  been  proposed  in  independent  work  by  [19],  [38]  and  [29],  and  analyzed 
under  high-dimensional  scaling  (p  3>  N).  Each  of  the  above  papers  establish  consistency  and 
convergence  rates  for  the  prediction  error  under  certain  conditions  on  the  covariates  as  well  as 
the  sparsity  s  and  dimension  p.  However,  it  is  not  clear  whether  the  rates  obtained  in  these 
papers  are  sharp  for  the  given  methods,  nor  whether  the  rates  are  minimax-optimal.  Past 
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work  by  [20]  establishes  rates  for  sparse  additive  models  with  an  additional  global  bound¬ 
edness  condition,  but  as  will  be  discussed  at  more  length  in  the  sequel,  these  rates  are  not 
minimax  optimal  in  general. 

This  paper  makes  three  main  contributions  to  this  line  of  research.  Our  first  contribution 
is  to  analyze  a  simple  polynomial-time  method  for  estimating  sparse  additive  models  and 
provide  upper  bounds  on  the  error  in  the  L2( P)  and  L2(Pn)  norms.  The  estimator1  that 
we  analyze  is  based  on  a  combination  of  least-squares  loss  with  two  G-based  sparsity  penalty 
terms,  one  corresponding  to  an  Li2( Pn)  norm  and  the  other  an  l\/\\ •  ||%  norm.  Our  first  main 
result  shows  that  with  high  probability,  if  we  assume  the  univariate  functions  are  bounded  and 
independent,  the  error  of  our  procedure  in  the  squared  L2(Pn)  and  L2(P)  norms  is  bounded  by 
O  ( kl°j^p  +  sh2 ) ,  where  the  quantity  <52  corresponds  to  the  optimal  rate  for  estimating  a  single 
univariate  function.  Importantly,  our  analysis  does  not  require  a  global  boundedness  condition 
on  the  class  J-s  of  ah  s-sparse  models,  an  assumption  that  is  often  imposed  in  classical  non- 
parametric  analysis.  Indeed,  as  we  discuss  below,  when  such  a  condition  is  imposed,  then 
significantly  faster  rates  of  estimation  are  possible.  The  proof  involves  a  combination  of 
techniques  for  analyzing  M- -estimators  with  decomposable  regularizers  [32]  combined  with 
various  techniques  in  empirical  process  theory  for  analyzing  kernel  classes  [7,  31,  42].  Our 
second  contribution  is  complementary  in  nature,  in  that  it  establishes  algorithm-independent 
minimax  lower  bounds  on  L2(P)  error.  These  minimax  lower  bounds  are  specified  in  terms 
of  the  metric  entropy  of  the  underlying  univariate  function  classes.  For  both  finite-rank 
kernel  classes  and  Sobolev-type  classes,  these  lower  bounds  match  our  achievable  results  up 
to  constant  factors  in  the  regime  of  sub-linear  sparsity  ( s  =  o(p)).  Thus,  for  these  function 
classes,  we  have  a  sharp  characterization  of  the  associated  minimax  rates.  The  lower  bounds 
derived  in  this  paper  initially  appeared  in  the  Proceedings  of  the  NIPS  Conference  (December 
2009).  The  proofs  of  Theorem  2  is  based  on  characterizing  the  packing  entropies  of  the  class 
of  sparse  additive  models,  combined  with  classical  information  theoretic  techniques  involving 
Fano’s  inequality  and  variants  [16,  43,  44], 

6  High-dimensional  inference  with  graphical  models 

Inference  methods  based  on  graphical  models  are  now  prevalent  in  many  fields,  running  the 
gamut  from  computer  vision  and  civil  engineering  to  political  science  and  epidemiology.  In 
many  applications,  learning  the  edge  structure  of  the  underlying  graphical  model  is  relevant 
to  the  researcher — for  instance,  a  graphical  model  may  be  used  to  represent  friendships  be¬ 
tween  people  in  a  social  network,  or  links  between  organisms  with  the  propensity  to  spread 
an  infectious  disease.  It  is  well-known  that  zeros  in  the  inverse  covariance  matrix  of  a  multi¬ 
variate  Gaussian  distribution  indicate  the  absence  of  an  edge  in  the  corresponding  graphical 
model.  This  fact,  combined  with  techniques  in  high-dimensional  statistical  inference,  has  been 
leveraged  by  many  authors  to  recover  the  structure  of  a  Gaussian  graphical  model  when  the 
edge  set  is  sparse  (e.g.,  see  the  papers  [39,  46,  11,  30]  and  references  therein).  Recently,  Liu  et 
al.  [23,  22]  introduced  the  notion  of  a  nonparanormal  distribution,  which  generalizes  the  Gaus¬ 
sian  distribution  by  allowing  for  univariate  monotonic  transformations,  and  argued  that  the 
same  structural  properties  of  the  inverse  covariance  matrix  carry  over  to  the  nonparanormal. 

However,  the  question  of  whether  there  exists  a  relationship  between  conditional  indepen¬ 
dence  and  the  structure  of  the  inverse  covariance  matrix  in  a  general  graph  remains  unresolved. 
In  this  paper,  we  focus  on  discrete  graphical  models,  and  establish  a  number  of  interesting 

lrriie  same  estimator  was  proposed  concurrently  by  [20];  we  discuss  connections  to  this  work  in  the  sequel. 
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links  between  covariance  matrices  and  the  edge  structure  of  the  underlying  graph.  We  show 
that,  instead  of  only  analyzing  the  standard  covariance  matrix,  it  is  often  fruitful  to  augment 
the  usual  covariance  matrix  with  higher-order  interaction  terms.  Our  main  result  has  a  strik¬ 
ing  corollary  in  the  context  of  tree-structured  graphs:  for  any  discrete  graphical  model,  the 
inverse  of  a  generalized  covariance  matrix  is  always  (block)  graph-structured.  In  particular, 
for  binary  variables,  the  inverse  of  the  usual  covariance  matrix  reflects  the  zero-pattern  of  the 
tree.  We  also  establish  a  number  of  more  general  results  that  apply  to  non-tree-structured 
graphs,  and  derive  several  corollaries  that  guarantee  consistency  of  known  methods  for  graph 
selection  and  lead  to  a  new  method  for  neighborhood  selection  in  an  arbitrary  sparse  graph. 
Our  methods  handle  noisy  or  missing  data  in  a  seamless  manner. 

Other  related  work  on  graphical  model  selection  for  discrete  graphs  includes  the  clas¬ 
sic  Chow-Liu  algorithm  for  trees  [12],  nodewise  logistic  regression  for  discrete  models  with 
pairwise  interactions  [40,  18],  and  techniques  based  on  conditional  entropy  or  mutual  infor¬ 
mation  [9,  4],  Our  main  contribution  is  to  present  a  clean  and  surprising  result  on  a  simple 
link  between  the  inverse  covariance  matrix  and  edge  structure  of  a  discrete  model,  which  may 
be  used  to  derive  inference  algorithms  applicable  even  to  data  with  systematic  corruptions. 


(a)  Dino  graph  with  missing  data  (b)  Chain  graph  with  missing  data  (c)  Star  graph 

Figure  4.  Simulation  results  for  global  and  nodewise  recovery  methods  on  binary  Ising  models. 

Panel  (a)  shows  simulation  results  for  the  log-determinant  method  applied  to  the  dinosaur 
graph,  averaged  over  50  trials.  Panel  (b)  shows  simulation  results  for  nodewise  regression 
applied  to  chain  graphs  with  different  numbers  of  nodes,  averaged  over  50  trials.  Panel  (c) 
shows  simulation  results  for  nodewise  regression  applied  to  star  graphs  with  maximal  node 
degree  ,/p,  averaged  over  100  trials.  The  horizontal  axis  gives  the  rescaled  sample  size  d/gp- 

Figure  4  depicts  the  results  of  several  simulations  we  performed  to  test  our  theoretical 
predictions.  In  all  cases,  we  generated  binary  Ising  models  with  node  weights  0.1  and  edge 
weights  0.3  (using  spin  {—1, 1}  variables).  Panel  (a)  shows  the  results  of  our  global  recovery 
method  applied  to  the  dinosaur  graph.  The  solid  curve  shows  the  probability  of  success  in 
recovering  the  15  edges  of  the  graph,  as  a  function  of  the  rescaled  sample  size  where  p  = 

13.  The  dotted  curves  show  the  corresponding  success  probabilities  for  missing  data  fractions 
p  £  {0.1,0.15,0.2},  using  the  corrected  estimator.  We  observe  that  all  four  runs  display 
a  transition  from  success  probability  0  to  success  probability  1  in  roughly  the  same  range, 
as  predicted  by  our  theory.  Indeed,  since  the  dinosaur  graph  has  only  singleton  separators, 
our  theory  ensures  that  the  inverse  covariance  matrix  is  exactly  graph-structured.  Note  that 
the  curves  shift  right  as  the  fraction  p  of  missing  data  increases,  since  the  problem  becomes 
harder. 

Panels  (b)  and  (c)  show  simulation  results  for  our  nodewise  regression  method  on  chain 
and  star  graphs,  with  increasing  numbers  of  nodes  p  £  {16,32,64}.  The  modified  Lasso 
program  was  optimized  using  a  form  of  composite  gradient  descent  due  to  Agarwal  et  al.  [3], 


guaranteed  to  converge  to  a  small  neighborhood  of  the  optimum  even  when  the  problem  is 
non-convex  [27].  Our  theory  predicts  that,  when  plotted  against  rescaled  sample  size 
the  curves  for  different  problems  should  be  roughly  aligned,  which  we  observe  for  both  fully- 
observed  samples  (solid  lines)  and  missing  data  (dotted  lines).  Again,  the  curves  for  missing 
data  (p  =  0.1)  are  shifted  right  relative  to  the  curves  for  fully-observed  data,  since  the  recovery 
problem  becomes  harder  with  a  higher  fraction  of  missing  data.  Panel  (c)  shows  similar  curves 
for  fully-observed  samples  from  a  star  graph,  where  the  central  hub  has  degree  yjp.  Here,  we 
use  the  rescaled  sample  size  ,  since  the  degree  is  growing. 
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